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Abstract. In this paper we model user behaviour in Twitter to capture 
the emergence of trending topics. For this purpose, we first extensively 
analyse tweet datasets of several different events. In particular, for these 
datasets, we construct and investigate the retweet graphs. We find that 
the retweet graph for a trending topic has a relatively dense largest con¬ 
nected component (LCC). Next, based on the insights obtained from the 
analyses of the datasets, we design a mathematical model that describes 
the evolution of a retweet graph by three main parameters. We then 
quantify, analytically and by simulation, the influence of the model pa¬ 
rameters on the basic characteristics of the retweet graph, such as the 
density of edges and the size and density of the LCC. Finally, we put 
the model in practice, estimate its parameters and compare the resulting 
behavior of the model to our datasets. 
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1 Introduction 

Nowadays, social media play an important role in our society. The topics people 
discuss on-line are an image of what interests the community. Such trends may 
have various origins and consequences: from reaction to real-world events and 
naturally arising discussions to the trends manipulated e.g. by companies and 
organisations P3j. Trending topics on Twitter are ‘ongoing’ topics that become 
suddenly extremely populaiQ In our study, we want to reveal differences in the 
retweet graph structure for different trends and model how these differences 
arise. 

* The work of Nelly Litvak is partially supported the EU-FET Open grant NADINE 
(288956) 
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In Twitter]^] users can post messages that consist of a maximum of 140 char¬ 
acters. These messages are called tweets. One can “follow” a user in Twitter, 
which places their messages in the message display, called the timeline. Social 
ties are directed in Twitter, thus if user A follows user B, it does not imply that 
B follows A. People that “follow” a user are called “friends” of this user. We refer 
to the network of social ties in Twitter as the friend-follower network. Further, 
one can forward a tweet of a user, which is called a retweet. 

There have been many studies on detecting different types of trends, for 
instance detecting emergencies [9], earthquakes m, diseases [T3] or important 
events in sports mi- In many current studies into trend behaviour, the focus 
is mainly on content of the messages that are part of the trend, see e.g. [T2] . 
Our work focuses instead on the underlying networks describing the social ties 
between users of Twitter. Specifically, we consider a graph of users, where an 
edge means that one of the users has retweeted a message of a different user. 

In this study we use several datasets of tweets on multiple topics. First we 
analyse the datasets, described in Section [3j by constructing the retweet graphs 
and obtaining their properties as discussed in Section[4j Next, we design a math¬ 
ematical model, presented in Section [5j that describes the growth of the retweet 
graph. The model involves two attachment mechanisms. The first mechanism 
is the preferential attachment mechanism that causes more popular messages 
to be retweeted with a higher probability. The second mechanism is the super- 
star mechanism which ensures that a user that starts a new discussion receives 
a finite fraction of all retweets in that discussion 2]. We quantify, analytically 
and with simulations, the influence of the model parameters on its basic char¬ 
acteristics, such as the density of edges, the size and the density of the largest 
connected component. In Section [6] we put the model in practice, estimate its 
parameters and compare it to our datasets. We find that what our model cap¬ 
tures, is promising for describing the retweet graphs of trending topics. We close 
with conclusions and discussion in Section [7] 


2 Related work 

The amount of literature regarding trend detection in Twitter is vast. The 
overview we provide here is by no means complete. Many studies have been 
performed to determine basic properties of the so-called “Twitterverse”. Kwak 
et al. m analysed the follower distribution and found a non-power-law distri¬ 
bution with a short effective diameter and a low reciprocity. Furthermore they 
found that ranking by the number of followers and PageRank both induce sim¬ 
ilar rankings. They also report that Twitter is mainly used for News (85% of 
the content). Huberman et al. [Sj found that the network of interactions within 
Twitter is not equal to the follower network, it is a lot smaller. 

An important part of trending behaviour in social media is the way these 
trends progress through the network. Many studies have been performed on 
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Twitter data. For instance, [3] studies the diffusion of news items in Twitter for 
several well-known news media and finds that these cascades follow a star-like 
structure. Also, J2D] investigates the diffusion of information on Twitter using 
tweets on the Iranian election in 2009, and finds that cascades tend to be wide, 
not too deep and follow a power law-distribution in their size. 

Bhamidi et al. [2] proposed and validated on the data a so-called superstar 
random graph model for a giant component of a retweet graph. Their model is 
based on the well-known preferential attachment idea, where users with many 
retweets have a higher chance to be retweeted [T], however, there is also a super- 
star node that receives a new retweet at each step with a positive probability. We 
build on this idea to develop our model for the progression of a trend through 
the Twitter network. 

Another perspective on the diffusion of information in social media is ob¬ 
tained through analysing content of messages. For example, m finds that on 
Twitter, tags tend to travel to more distant parts of the network and URLs 
travel shorter distances. Romero et al. [T^ analyse the spread mechanics of con¬ 
tent through hashtag use and derive probabilities that users adopt a hashtag. 

Classification of trends on Twitter has attracted considerable attention in 
the literature. Zubiaga et al. |5T] derive four different types of trends, using 15 
features to make their distinction. They distinguish trends triggered by news, 
current events, mernes or commemorative tweets. Lehmann et al. |l2j study 
different patterns of hashtag trends in Twitter. They also observe four different 
classes of hashtag trends. Rattanaritnont et al. m propose to distinguish topics 
based on four factors, which are cascade ratio, tweet ratio, time of tweet and 
patterns in topic-sensitive hashtags. 

We extend the model of [2] by mathematically describing the growth of a 
complete retweet graph. Our proposed model has two more parameters that 
define the shape of the resulting graph, in particular, the size and the density of 
its largest connected component. To the best of our knowledge, this is the first 
attempt to classify trends using a random graph model rather than algorithmic 
techniques or machine learning. The advantage of this approach is that it gives 
insight in emergence of the trend, which, in turn, is important for understanding 
and predicting the potential impact of social media on real world events. 

3 Datasets 

We use datasets containing tweets that have been acquired either using the 
Twitter Streaming APf]or the Twitter REST apQ Using the REST API one 
can obtain tweets or users from Twitter's databases. The Streaming API filters 
tweets that Twitter parses during a day, for example, based on users, locations, 
hashtags, or keywords. 

Most of the datasets used in this study were scraped by RTreporter, a com¬ 
pany that uses an incoming stream of Dutch tweets to detect news for news 
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' https://dev.twitter.com/docs/api/1.1 



agencies in the Netherlands. These tweets are scraped based on keywords, using 
the Streaming API. For this research, we selected several events that happened 
in the period of data collection, based on the wikipedia overviews of 2013 and 
20lj^J We have also used two datasets scraped by TNO - Netherlands Organ¬ 
isation for Applied Scientific Research. The Project X dataset contains tweets 
related to large riots in Haren, the Netherlands. This dataset is acquired by 
Twitciden^_ ] For this study, we have filtered this dataset on two most important 
hashtags: #projectx and #projectxharen. The Turkish-Kurdish dataset is de¬ 
scribed in more detail in Bourna et al. [?] . A complete overview of the datasets, 
including the events and the keywords, is given in Table [l] The size and the 
timespans for each dataset are given in Table [2j 



dataset 

keywords 

PX 

Project X Haren 

projectx, projectxharen 

TK 

Demonstrations in Amsterdam 
related to the Turkish-Kurdish conflict 

koerden, turken, rellen, museumplein, 
amsterdam 

wcs 

World cup speedskating single distanced 2013 

wkafstanden, sochi, sotsji 

W-A 

Crowning of His Majesty King 

Willem-Alexander in the Netherlands 

troonswisseling, troon, Willem-Alexander, 
Wim-Lex, Beatrix, koning, koningin 

ESF 

Eurovision Song Festival 

esf, Eurovisie Songfestival, ESF, 
songfestival, eurovisie 

CL 

Champions Leage final 2013 

Bayern Munchen, Borussia Dortmund, 
dorbay, borussia, bayern, borbay, CL 

Morsi 

Morsi deposited as Egyption president 

Morsi, afgezet, Egypte 

Train 

Train crash in Santiago, Spain 

Treincrash, treincrash, Santiago, 

Spanje, Santiago de Compostella, trein 

Heat 

Heat wave in the Netherlands 

hittegolf, Nederland 

Damascus 

Sarin attack in Damascus 

Sarin, Damascus, Syrie, syrie 

Peshawar 

Bombing in Peshawar 

Peshawar, kerk, zelfmoordaanslag, Pakistan 

Hawk 

Hawk spotted in the Netherlands 

sperweruil, Zwolle 

Pile-up 

Multiple pile-ups in Belgium on the A19 

A19, leper, Kortrijk, kettingbotsing 

Schumi 

Michael Schumachar has a skiing accident 

Michael Schumacher, ski-ongeval 

UKR 

Rebellion in Ukrain 

Azarov, Euromaidan, Euromajdan, Oekrai'ne, 
opstand 

NAM 

Treaty between NAM and Dutch government 

Loppersum, gasakkoord, NAM, Groningen 

WCD 

Michael van Gerwen wins PDC WC Darts 

van Gerwen, PDC, WK Darts 

NSS 

Nuclear Security Summit 2014 

NSS2014, NSS, 

Nuclear Security Summit 2014, 

Den Haag 

MH730 

Flight MH730 disappears 

MH730, Malaysia Airlines 

Crimea 

Crimea referendum for independance 

Krim, referendum, onafhankelijkheid 

Kingsday 

First Kingsday in the Netherlands 

koningsdag, kingsday, koningsdag 

Volkert 

Volkert van der Graaf released from prison 

Volkert, volkertvandergraaf, 

Volkert van der Graaf 


Table 1 . Datasets: events and keywords (some keywords are in Dutch). 


For each dataset we have observed there is at least one large peak in the 
progression of the number of tweets. For example, Figure [I] shows such peak in 
Twitter activity for the Project X dataset. 

s http://nl.wikipedia.org/wiki/2014 & http://nl.wikipedia.org/wiki/2013 
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dataset 

year 

first tweet 

last tweet 

# tweets 

# retweets 

PX 

2012 

Sep 17 09:37:18 

Sep 26 02:31:15 

31,144 

15,357 

TK 

2011 

Oct 19 14:03:23 

Oct 27 08:42:18 

6,099 

999 

wcs 

2013 

Mar 21 09:19:06 

Mar 25 08:45:50 

2,182 

311 

W-A 

2013 

Apr 27 22:59:59 

May 02 22:59:25 

352,157 

88,594 

ESF 

2013 

May 13 23:00:08 

May 18 22:59:59 

318,652 

82,968 

CL 

2013 

May 22 23:00:04 

May 26 22:59:54 

163,612 

54,471 

Morsi 

2013 

Jun 30 23:00:00 

Jul 04 22:59:23 

40,737 

13,098 

Train 

2013 

Jul 23 23:00:02 

Jul 30 22:59:41 

113,375 

26,534 

Heat 

2013 

Jul 10 19:44:35 

Jul 29 22:59:58 

173,286 

42,835 

Damascus 

2013 

Aug 20 23:01:57 

Aug 31 22:59:54 

39,377 

11,492 

Peshawar 

2013 

Sep 21 23:00:00 

Sep 24 22:59:59 

18,242 

5,323 

Hawk 

2013 

Nov 11 23:00:07 

Nov 30 22:58:59 

54,970 

19,817 

Pile-up 

2013 

Dec 02 23:00:15 

Dec 04 22:59:57 

6,157 

2,254 

Schumi 

2013-14 

Dec 29 02:43:16 

Jan 01 22:54:50 

13,011 

5,661 

UKR 

2014 

Jan 26 23:00:36 

Jan 31 22:57:12 

4,249 

1,724 

NAM 

2014 

Jan 16 23:00:22 

Jan 20 22:59:49 

41,486 

14,699 

WCD 

2013-14 

Dec 31 23:03:48 

Jan 02 22:59:05 

15,268 

5,900 

NSS 

2014 

Mar 23 23:00:06 

Mar 24 22:59:56 

29,175 

13,042 

MH730 

2014 

Mar 08 00:18:32 

Mar 28 22:40:44 

36,765 

17,940 

Crimea 

2014 

Mar 13 23:02:22 

Mar 17 22:59:57 

18,750 

5,881 

Kingsday 

2014 

Apr 26 23:00:00 

Apr 29 22:53:00 

7,576 

2,144 

Volkert 

2014 

Apr 30 23:08:14 

May 04 22:57:06 

9,659 

4,214 


Table 2. Characteristics of the datasets. 
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Fig. 1. Project X Number of tweets and cumulative number of tweets per hour. 


When a retweet is placed on Twitter, the Streaming API returns the retweet 
together with the message that has been retweeted. We use this information to 
construct the retweet trees of every message and the user IDs for each posted 
message. The tweet and graph analysis is done using Python and its modules 
TweepP^and NetworkX$^\ In this paper, we investigate the dynamics of retweet 
graphs with the goal to predict peaks in Twitter activity and classify the nature 
of trends. 


10 http://www.tweepy.org/ 

11 http://networkx.github.io/ 




















4 Retweet graphs 


Our main object of study is the retweet graph G = (V,E), which is a graph 
of users that have participated in the discussion on a specific topic. A directed 
edge e = (u, v) indicates that user v has retweeted a tweet of u. We observe 
the retweet graph at the time instances t = 0,1,2,..where either a new node 
or a new edge was added to the graph, and we denote by G t = ( V t ,E t ) the 
retweet graph at time t. As usual, the out- (in-) degree of node u is the number 
of directed edges with source (destination) in u. In what follows, we model and 
analyse the properties of G*. For every new message initiated by a new user u 
a tree T u is formed. Then, 7t denotes the forest of message trees. Note that in 
our model a new message from an already existing user u (that is, u £ Tt) does 
not initiate a new message tree. We define |7*| as the number of new users that 
have started a message tree up to time t. 

After analyzing multiple characteristics of the retweet graphs for every hour 
of their progression, we found that the size of the largest (weakly) connected 
component (LCC) and its density are the most informative characteristics for 
predicting the peak in Twitter. In Figure [2] we show the development of these 
characteristics in the Project X dataset. One day before the actual event, we 
observe a very interesting phenomenon in the development of the edge density 
of the LCC in Figure [2a| Namely, at some point the edge density of the LCC 
exceeds 1 (indicated by the dash-dotted gray lines), i.e. there is more than one 
retweet per user on average. We shall refer to this as the densification (or dens.) 
of the LCC. Furthermore, the relative size of the LCC increases from 18% to 


25% as well, see Figure 2b 



(a) Edge density. 



o 

o 
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(b) Size of LCC. 


Fig. 2. Progression for the edge density (a) and the size of the LCC (b) in the Project 
X dataset. 


We have observed a densification of the LCC in each dataset that we have 
studied. Indeed, when the LCC grows its density must become at least one (each 
node is added to the LCC together with at least one edge). However, we have also 
observed that in each dataset the densification occurs before the main peak, but 














the scale of densification is different. For example, in the Project X dataset the 
densification already occurs one day before the peak activity. Plausibly, in this 
discussion, that ended up in riots, a group of people was actively participating 
before the event. On the other hand, in the WCS dataset, which tweets about 
an ongoing sport event, the densication of the LCC occurs during the largest 
peak. This is the third peak in the progression. Hence, our experiments suggest 
that the time of densification has predictive value for trend progression and 
classification. See Table [5] for the density of the LCC in each dataset at the end 
of the progression. 


5 Model 

Our goal is to design a model that captures the development of trending be¬ 
haviour. In particular, we need to capture the phenomenon that disjoint com¬ 
ponents of the retweet graph join together forming the largest component, of 
which the density of edges may become larger than one. To this end, we employ 
the superstar model of Bhamidi et al. [2] for modelling distinct components of 
the retweet graph, and add the mechanism for new components to arrive and 
the existing components to merge. For the sake of simplicity of the model we ne¬ 
glect the friend-follower network of Twitter. Note that in Twitter every user can 
retweet any message sent by any public user, which supports our simplification. 

At the start of the progression, we have the graph Gq. In the analysis of this 
section, we assume that Go consists of a single node. Note that in reality, this 
does not need to be the case: any directed graph can be used as an input graph 
Go- In fact, in Section [6] we start with the actual retweet graph at a given point 
in time, and then use the model to build the graph further to its final size. 

We consider the evolution of the retweet graph in time [G t )t> o- We use a 
subscript t to indicate Gt and related notions at time t. We omit the index t 
when referring to the graph at the end of the progression. 

Recall that Gt is a graph of users , and an edge ( u , v ) means that v has 
retweeted a tweet of u. We consider time instances t = 1,2,... when either a 
new node or a new edge is added to the graph G t -i- We distinguish three types 
of changes in the retweet graph: 

o Tl: a new user u has posted a new message on the topic, node u is added to 

Gt- 1 ; 

o T2: a new user v has retweeted an existing user u, node v and edge (u, v ) 
are added to G t -i; 

o T 3: an existing user v has retweeted another existing user u, edge (it, v) is 
added to G t -\. 

The initial node is equivalent to a Tl arrival at time t = 0. Assume that each 
change in Gt at t = 1, 2,... is Tl with probability A/(1 + A), independently of 
the past. Also, assume that a new edge (retweet) is coming from a new user with 
probability p. Then the probabilities of Tl, T2 and T3 arrivals are, respectively 


a^T » a+T ) X+f ■ The parameter p is governing the process of components merging 
together, while A is governing the arrival of new components in the graph. 

For both T 2 and T 3 arrivals we define the same mechanism for choosing the 
source of the new edge (u,v) as follows. 

Let Mo, Mi,... be the users that have been added to the graph as T 1 arrivals, 
where uq is the initial node. Denote by X/t the subgraph of Gt that includes Ui 
and all users that have retweeted the message of u, in the interval (0, t). We call 
such a subgraph a message tree with root Ui. We assume that the probability 
that a T 2 or T 3 arrival at time t will attach an edge to one of the nodes in T i t -i 
with probability pr ijt - 1 , proportional to the size of the message tree: 


PTi ’ t - 1 e 
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This creates a preferential attachment mechanism in the formation of the mes¬ 
sage trees. Next, a node in the selected message tree T i t _-\ is chosen as the source 
node following the superstar attachment scheme [2]: with probability q , the new 
retweet is attached to Mi, and with probability 1 — q, the new retweet is attached 
to any other vertex, proportional to the preferential attachment function of the 
node, that we choose to be the number of children of the node plus one. 

Thus we employ the superstar-model, which was suggested in [2j for modelling 
the largest connected component of the retweet graph on a given topic, in order 
to describe a progression mechanism for a single retweet tree. Our extensions 
compared to [2| are that we allow new message trees to appear (T1 arrivals), 
and that different message trees may either remain disconnected or get connected 
by a T 3 arrival. 

For a T 3 arrival, the target of the new edge (m, v) is chosen uniformly at 
random from V)_i, with the exception of the earlier chosen source node u, to 
prevent self-loops. That is, any user is equally likely to retweet a message from 
another existing user. 

Note that, in our setting, it is easy to introduce a different superstar param¬ 
eter Qt.. for every message tree T). This way one could easily implement specific 
properties of the user that starts the message tree, e.g. his/her number of fol¬ 
lowers. For the sake of simplicity, we choose the same value of q for all message 
trees. Also note that we do not include tweets and retweets that do not result 
in new nodes or edges in a retweet graph. This could be done, for example, by 
introducing dynamic weights of vertices and edges, that increase with new tweets 
and retweets. Here we consider only an unweighted model. 


5.1 Growth of the graph 

The average degree, or edge density, is one of the aspects through which we give 
insight to the growth of the graph. The essential properties of this characteristic 
are presented in Theorem [l] The proof is given in the Appendix. 
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Theorem 1 Let r n be the time when node n is added to the graph. Then 
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Note that the variance of the average degree in ([2| converges to zero as 
n —> oo at rate -. 

The next theorem studies the observed ratio between T 2 and T 3 arrivals 
(new edges) and T1 arrivals (new nodes with a new message). As we see from 
the theorem, this ratio can be used for estimating the parameter A. The proof 
is given in the Appendix. 


Theorem 2 Let G t = ( Vt,E t ) be the retweet graph at time t, let 7t be the set 
of all message trees in Gt ■ Then 
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where Z is a standard normal N( 0,1) random variable, and —> denotes conver¬ 
gence in distribution. 


Note that, as expected from the definition of A, 
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This will be used in Section [6] for estimating A. 
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5.2 Component size distribution 

In the following, we assume that Gt consists of m connected components 
(Ci, C 2 , ■ ■ ■, C m ) with known respective sizes (|Ci|,..., |C m |). We aim to derive 
expressions for the distribution of the component sizes in G t + 1 - 


Lemma 3 The distribution of the sizes of the components of Gt+\, given Gt is 
as follows, 


'|C 1 |,...,|C i |,|C i |,...,|C m |,l 
|Ci|> • • • > \Ci\ + 1, \Cj \,..., \c m \ 
|C 1 |,...,|C*| + |C 3 |,...,|C m | 
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The proof of Lemma[3]is given in the Appendix. Lemma[3] provides a recursion 
for computing the distribution of component sizes. However, the computations 
are highly demanding if not infeasible. Also, deriving an exact expression of the 
distribution of the component sizes at time t is very cumbersome because they are 
hard and they strongly depend on the events that occurred at t = 0,..., t — 1. 
Note that if p = 1, there is a direct correspondence between our model and 
the infinite generalized Polya process [5]. However, this case is uninformative as 
there are no T 3 arrivals. Therefore, in the next section we resort to simulations to 
investigate the sensitivity of the graph characteristics to the model parameters. 


5.3 Influence of q , p and A 

We analyze the influence of the model parameters A, p and q on the character¬ 
istics of the resulting graph numerically using simulations. To this end, we fix 
two out of three parameters and execute multiple simulation runs of the model, 
varying the values for the third parameter. We start simulations with graph Go, 
consisting of one node. We perform 50 simulation runs for every parameter set¬ 
ting and obtain the average values over the individual runs for given parameters. 

Parameter q affects the degree distribution (21 and the overall structure of 
the graph. If q = 0, then the graph contains less nodes that have many retweets. 
If q = 1 each edge is connected to a superstar, and the graph consists of star-like 
sub graphs, some of which are connected to each other. In the Project X dataset, 
which is our main case study, q ss 0.9 results in a degree distribution that closely 
approximates the data. Since degree distributions are not in the scope of this 
paper, we omit these results for brevity. 

We compare the results for two measures that produced especially important 
characteristics of the Project X dataset: and These characteristics 

do not depend on q. In simulations, we set t = 1, 000, q = 0.9 and vary the values 
for p and A. the results are give in Figure [3] 
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Fig. 3. Numerical results for the model using q = 0.9 and t = 1, 000. 













We see that the edge density in the LCC in Figure |3a[ decreases with A and 
p. Note that according to |2£|/|V| is well approximated by 1 /(A + p) when 
A or p are large enough. The edge density in LCC shows a similar pattern, but 
it is slightly higher than in the whole graph. When A and p are small, there are 
many T 3 arrivals, and new nodes are not added frequently enough. This results 
in an unexpected non-monotonic behaviour of the edge density near the origin. 
For the fraction of nodes in the LCC, depicted in Figure [3bJ we see that the 
parameter A is most influential. The parameter p is of considerable influence 
only when it is large. 


6 The model in practice 


In this section we obtain parameter estimators for our model and compare the 
model to the datasets discussed in Section [3l 

Using Theorem jij we know that converges to A -1 as t —» oo. Thus, we 
suggest the following estimator for A at time t > 0: 


A t = 


\Et[ 


( 8 ) 


Second, we derive an expression for p t using (|T]) and substituting (??) for A: 


\Vt\-\T t \-l 

\Et\ 


(9) 


Since the Twitter API only gives back the original message of a retweet and 
not the level in the progression tree of that retweet, we can not determine q 
easily from the data. Since this parameter does not have a large influence on the 
outcomes of the simulations, we choose this parameter to be 0.9 for all datasets. 

Notice that we can obtain the numbers (|C t |, |7t| and |Vt|) directly from 
a given retweet graph for each t = 1,2,.... The computed estimators for our 
datasets are displayed in Table [5] 

Next, we compare 50 simulations of the datasets from the point of densifica- 
tion of the LCC until the graph has reached the same size as the actual dataset. 
We display the average outcomes of these simulations and compare them to the 
actual properties of the retweet graphs of each dataset in Table [5| 

Here we see diverse results per dataset in the simulations. For the CL, Morsi 
and WCD datasets, the simulations are very similar to the actual progressions. 
However, for some datasets, for instance the ESF dataset, simulations are far 
off. In general, the model predicts the density of the LCC quite well for many 
datasets, but tends to overestimate the size of the LCC. We notice that current 
random graph models for networks usually capture one or two essential features, 
such as degree distribution, self-similarity, clustering coefficient or diameter. Our 
model captures both degree distribution and, in many cases, the density of the 
LCC. It seems that our model performs better on the datasets that have a 
singular peak rather than a series of peaks. We have observed on the data that 



dataset 

A p 

actual progression 
\ V LCC\ l-E| \E L qq\ 

simulat 

\ V LCC\ 

ions (sta 

\E\ 

m 

rting at dens.) 
\ e lcc\ 

\v\ 

m 

Vj.nn 1 

|R| 

Wr.arA 

PX 

.23 .78 

.76 

1.00 

1.12 

.54 

.75 

1.08 

TK 

.42 .85 

.25 

.79 

1.00 

.54 

.74 

1.08 

wcs 

.49 .73 

.20 

.81 

.99 

.49 

.95 

1.90 

W-A 

.41 .52 

.67 

1.07 

1.30 

.40 

.62 

1.41 

ESF 

.38 .43 

.73 

1.24 

1.48 

.45 

.69 

1.42 

CL 

.40 .72 

.44 

.90 

1.22 

.46 

.66 

1.16 

Morsi 

.60 .55 

.39 

.87 

1.20 

.47 

.67 

1.17 

Train 

.54 .78 

.28 

.76 

1.04 

.50 

.70 

1.17 

Heat 

.42 .59 

.60 

.99 

1.23 

.41 

.72 

1.68 

Damascus 

.58 .51 

.46 

.92 

1.24 

.44 

.65 

1.30 

Peshawar 

.54 .68 

.31 

.82 

1.18 

.53 

.75 

1.25 

Hawk 

.38 .38 

.82 

1.31 

1.45 

.49 

.76 

1.43 

Pile-up 

.33 .64 

.65 

1.03 

1.24 

.58 

.93 

1.54 

Schumi 

.38 .83 

.33 

.82 

1.08 

.56 

.77 

1.07 

UKR 

.72 .37 

.53 

.91 

1.12 

.50 

.75 

1.38 

NAM 

.44 .48 

.50 

1.09 

1.51 

.45 

.72 

1.51 

WCD 

.26 .81 

.66 

.94 

1.10 

.64 

.83 

1.07 

NSS 

.26 .62 

.79 

1.13 

1.26 

.23 

.35 

1.21 

MH730 

.33 .52 

.15 

1.18 

1.00 

.56 

.76 

1.09 

Crimea 

.44 .63 

.51 

.93 

1.19 

.52 

.72 

1.12 

Kingsday 

.47 .92 

.07 

.72 

1.11 

.47 

.67 

1.15 

Volkert 

.29 .55 

.79 

1.18 

1.31 

.64 

.87 

1.22 


Table 5. Estimated parameter values using complete dataset, simulation and progres¬ 
sion properties. 


each peak activity has a large impact on the parameters estimation. We will 
strive to adopt the model for incorporating different rules for activity during 
peaks, and improving results on the size of the LCC. 

7 Conclusion and Discussion 

We have found that our model performs well in modelling the retweet graph 
for tweets regarding a singular topic. However, there is a room for improvement 
when the dataset covers a prolonged discussion with users activity fluctuating 
over time. 

A possible extension of the present work is incorporating more explicitly the 
time aspect into our model. We could for example add the notion of ‘novelty’, 
like Gomez et al. in [8], taking into account that e.g. the retweet probability for a 
user may decrease the longer he/she remains silent after having received a tweet. 
But also other model parameters may be assumed to vary over time. In addition, 
we propose to analyse the clustering coefficient of a node in the network model 
and, in particular, to investigate how it evolves over time. This measure (see 
[19] ) provides more detailed insight in how the graph becomes denser, making it 
possible to distinguish between local and global density. 
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Appendix 

Al. Proof of Theorem [l] 

Proof. The proof is based on the fact that the total number of edges | E Tri | equals 
a total number of the T 2 and T 3 arrivals on (0, r n ]. By definition, (0, r n ] contains 
exactly {n— 1) of T1 or T 2 arrivals, hence, the number of T 2 arrivals has a Bino¬ 
mial distribution with number of trials equal to (n — 1), and success probability 
P(T2 | T1 or T2) = Next, the number of T3 arrivals on [Tj,Tj+i), where 
i = 1,... ,n — 1, has a shifted geometric distribution, namely, the probability of 
k T 3 arrivals on [ri,Ti + i) is 


1 - 


1 ~P 
A + 1 


1 ~P 
A + l 


, k = 0,1,.... 


Observe that there have been n — 1 of these transitions from 1 node to n. Hence, 
the number of T 3 arrivals on (0, r ra ] is the sum of (n — 1) i.i.d. Geometric random 
variables with mean AA Summarizing the above, we obtain |l|. For ( 21) we also 


A+p- ° o - -- p 

need to observe that the number of T 2 and T 3 arrivals on [0, Tn\ are independent. 


A2. Proof of Theorem [2] 

Proof. Let X t be the number of T 2 and T3 arrivals by time t. Note that \E t \ = 
X t , and \Tt\ = t — X t + 1, which is the number of Ti arrivals on [0,t], since 
the first node at time t = 0 is by definition a Tl arrival. Note that X t has a 
binomial distribution with parameters t and P (T2 arrival) + P (T3 arrival) = 
A-j- . Furthermore, the number of Tl arrivals is t — X t + 1 since the first node 
at time t = 0 is by definition a Tl arrival. Hence, 














which proves (J3|. Next, we write 
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where 1 {a} is an indicator of event A. Denoting 

Xt — E \Xt\ _ (A + 1 )Xt — t 


Z t = 


\/var (X t ) 


v/A t 


( 10 ) 


(ID 


we further write 

^ i + 1 
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t — X t 




= E 


(i + 1)(A + 1) . 

Ml - -Sh) ' 


L {z t <V Xt-^1} 


We now split the indicator above as follows: 

z t <-Vxt} + i {-Vxt<z t <Vxt/ 2 } + 1 {V\t/ 2 <z t <V\t-^±y 
For the first and the third term we use the Chernoff bound: 
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< 2e~ xt/4 , 

Vxt 
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2e- At / 16 , 


( 12 ) 

(13) 

(14) 

(15) 


and notice that both expressions above converge to zero faster than 1/f. For 
the second case, note first that E [Z t \ = 0 and hence it follows from (??) and 
(??)-(151 that, as t —> oo, 


E 


Zt^{-y/\t<Z t <V\t/2} 
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Then we use the Taylor expansion to obtain: 


E 

1 j 

- 1 

1 Z t JL {-VXt<Z t <VXt/2} 

. V\t 




< E 


z[ 

A t 


2E 


(Af) 3 / 2 


01 i 


(16) 
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as t —» oo. By the central limit theorem, Z t —► Z as t —> oo. Furthermore, 
for r > 0, the convergence of moments holds 0: lim^Et^r] =E[|ZH. In 
particular, in (16), E [| Z t | 3 ] converges to a constant, and E [ Z t 2 ] converges to 1 
as t —» oo. Thus, using (10)—(121 and ^ we write 
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Now, subsequently using (??) - (16), we get 
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which results in Q. Statement (??) is proved along similar lines: we apply the 
expansion directly to the random variable 




(t + 1)(A + 1) 


t — Xf + 1 


(At + A + 1)(1 - Z t Tvrwrj) 
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- 1 , 


- At+A+l J 

and then use the Chernoff bounds and the CLT to obtain the result. 


A3. Proof of Lemma [3] 

Proof. Assume the arrival at time t + 1 is of type Tl. This occurs w.p. and 
then a new component consisting of size one is created in Gt+ 1 , corresponding 
to the first case in (|3|. 

Next, consider a T 2 arrival, which occurs w.p. yry- We now add a node to 
an existing component Ci w.p. y^-. Thus the probability that we add the new 
node to Ci is yy-j- • ypp 

Last, we consider a T 3 arrival. In this case we have two options. The new 
edge can either join two components, or join two nodes that are already in one 
component. For the first case, we derive the probability that Ci and Cj join as 


F(C. and C, merge) = 

Then for the second case, the number of ways a T3 arrival links two nodes that 
are already connected in a component, say Ci, is \Ci \ (|Ci| — 1). Therefore with 
probability V| 2 —|v^| * component size does not change. 



























